import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
Syllogism Conclusion Generator with GPT-2
Given two premises that form a valid syllogism, this autoregressive model can accurately complete the syllogism by generating a conclusion.
Introduction
This notebook will be an extension of the last notebook that was working to classify whether two premises could be used to generate a valid conclusion. That model used the a BERT architecture and was fine-tuned on the Avicenna syllogism dataset. This notebook will use the same dataset, but instead fine-tune a GPT-2 model to take in two premises as input and generate the corresponding conclusion.
I had to write custome start, end and pad tokens in order to properly pad each input as to be forced to randomly chop up syllogisms into pieced and create input blocks of equal size.
= 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'Avicenna_Train.csv'
file_name = "gpt2"
model_cp = 200
max_length = GPT2Tokenizer.from_pretrained(model_cp, bos_token = '<startoftext>',
tokenizer ='<endoftext>', pad_token='<pad>')
eos_token= DataCollatorForLanguageModeling(tokenizer, mlm=False, return_tensors='pt')
data_collator = GPT2LMHeadModel.from_pretrained(model_cp).to(device)
model len(tokenizer)) model.resize_token_embeddings(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Embedding(50260, 768)
def tokenize(batch):
return tokenizer(batch['text'], truncation=True, max_length=max_length, padding='max_length')
The data came in a csv file containing premise 1, premise 2, validity and conclusion. I needed to filter the dataset removing all invalid syllogisms and then combine the premises and conclusions into a single string for fine-tuning. I found out that telling the model which premise was which and where the conclusion started improved training. Additionally, adding a $ after the last premise slightly imporved training, but this was moreso done to replicate what as done in the original GPT paper.
def prepare_dataset(file_name):
= load_dataset('csv', data_files=file_name, sep = ',', encoding = 'ISO-8859-1')
dataset type='pandas')
dataset.set_format(= dataset['train'][:]
df = df[df['Syllogistic relation'] == 'yes']
df 'text'] = '<startoftext>' + 'Premise 1: ' + df['Premise 1'] + 'Premise 2:' + df['Premise 2'] + '$' + 'Conclusion:' + df['Conclusion'] + '<endoftext>'
df[=True, inplace=True)
df.reset_index(drop= df[['text']]
df = Dataset.from_pandas(df)
dataset = dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])
dataset return dataset
= prepare_dataset('Avicenna_Train.csv')
train_dataset = prepare_dataset('Avicenna_Test.csv') test_dataset
Using custom data configuration default-9e35c2288d530357
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-9e35c2288d530357/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)
Using custom data configuration default-80959f65edc13f7a
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-80959f65edc13f7a/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)
'input_ids'][0]) tokenizer.decode(train_dataset[
'<startoftext> Premise 1: Chronic diseases are heart attacks and stroke, cancer such as breast and colon cancer, diabetes, epilepsy and seizures, obesity, and oral health problems.Premise 2:In populations that eat a regular high-fiber diet of more than 50 grams of fiber per dayTrusted Source, like rural South Africans, chronic diseases are very low.$Conclusion:In populations that eat a regular high-fiber diet of more than 50 grams of fiber per dayTrusted Source, like rural South Africans, heart attacks and stroke, cancer such as breast and colon cancer, diabetes, epilepsy and seizures, obesity, and oral health problems are very low. <endoftext> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>'
Training
= model_cp.split("/")[-1]
model_name = TrainingArguments(
training_args f"{model_cp}-finetuned-syllogism",
= "epoch",
evaluation_strategy =2e-5,
learning_rate=0.01,
weight_decay=False,
push_to_hub )
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
= Trainer(
trainer =model,
model=training_args,
args=train_dataset,
train_dataset=test_dataset,
eval_dataset=data_collator
data_collator )
trainer.train()
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
***** Running training *****
Num examples = 2427
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 912
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | No log | 2.311477 |
2 | 2.267000 | 2.312688 |
3 | 2.267000 | 2.314481 |
***** Running Evaluation *****
Num examples = 630
Batch size = 8
Saving model checkpoint to gpt2-finetuned-syllogism/checkpoint-500
Configuration saved in gpt2-finetuned-syllogism/checkpoint-500/config.json
Model weights saved in gpt2-finetuned-syllogism/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
Num examples = 630
Batch size = 8
***** Running Evaluation *****
Num examples = 630
Batch size = 8
Training completed. Do not forget to share your model on huggingface.co/models =)
TrainOutput(global_step=912, training_loss=2.215385637785259, metrics={'train_runtime': 258.8719, 'train_samples_per_second': 28.126, 'train_steps_per_second': 3.523, 'total_flos': 743151283200000.0, 'train_loss': 2.215385637785259, 'epoch': 3.0})
import math
= trainer.evaluate()
eval_results print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
***** Running Evaluation *****
Num examples = 630
Batch size = 8
Perplexity: 10.12
Testing
First a classic example
= 'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion: ' test
= tokenizer(test, return_tensors='pt')['input_ids'].to(device) input_ids
= model.generate(input_ids, max_length=25)
output_greedy 0]) tokenizer.decode(output_greedy[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion: Socrates is mortal'
I dont know why it is generating this error, I believe it has something to do with me changing the models innate tokens in the beginning. But we can see that the model was able to accurately generate the conclusion for this syllogism. This first example is using greedy search, where our model simply makes a next word prediction based on the our probability distribution over the vocabulary.
= model.generate(input_ids, max_length=25, num_beams=5)
output_beam 0]) tokenizer.decode(output_beam[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion: Socrates is mortal'
Beam search is when the model generates n number of ‘beams’ or full sentence predictions (in this case 5) and then a word is decided based on highest probability and we continue moving down the rest of the sentence, not going back to earlier ones. This model also looks good. Beam will usually outperform greedy.
The End
See below for more tests and search methods.
= model.generate(input_ids, max_length=25, do_sample=True, temperature = 0.5)
output_temp 0]) tokenizer.decode(output_temp[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion: Socrates is mortal'
= model.generate(input_ids, max_length=25, do_sample=True, top_k=50)
output_topk 0]) tokenizer.decode(output_topk[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion: Socrates is a'
= model.generate(input_ids, max_length=25, do_sample=True, top_p=0.90)
output_topp 0]) tokenizer.decode(output_topp[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion: Socrates is mortal'
= 'Premise 1: All mammals are animals. Premise 2: All elephants are mammals. $ Conclusion: ' test2
= tokenizer(test2, return_tensors='pt')['input_ids'].to(device) input_ids
= model.generate(input_ids, max_length = 50)
output_greedy 0]) tokenizer.decode(output_greedy[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All mammals are animals. Premise 2: All elephants are mammals. $ Conclusion: All elephants are animals. <endoftext> '
= model.generate(input_ids, max_length = 50, num_beams=5)
output_beam 0]) tokenizer.decode(output_beam[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All mammals are animals. Premise 2: All elephants are mammals. $ Conclusion: All elephants are animals. <endoftext> '
= 'Premise 1: All mammals are warm-blooded. Premise 2: All black dogs are mammals. $ Conclusion: '
test3 = tokenizer(test3, return_tensors='pt')['input_ids'].to(device)
input_ids = model.generate(input_ids, max_length=40, num_beams = 5)
output_beam 0]) tokenizer.decode(output_beam[
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All mammals are warm-blooded. Premise 2: All black dogs are mammals. $ Conclusion: All black dogs are warm-blooded. <endoftext> <endoftext> animal is warm-'